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ABSTRACT 

Meta learning uses information from base learners (e.g. classifiers 
or estimators) as well as information about the learning problem to 
improve upon the performance of a single base learner. For example, 
the Bayes error rate of a given feature space, if known, can be used to 
aid in choosing a classifier, as well as in feature selection and model 
selection for the base classifiers and the meta classifier. Recent work 
in the field of /-divergence functional estimation has led to the de¬ 
velopment of simple and rapidly converging estimators that can be 
used to estimate various bounds on the Bayes error. We estimate 
multiple bounds on the Bayes error using an estimator that applies 
meta learning to slowly converging plug-in estimators to obtain the 
parametric convergence rate. We compare the estimated bounds em¬ 
pirically on simulated data and then estimate the tighter bounds on 
features extracted from an image patch analysis of sunspot contin¬ 
uum and magnetogram images. 

Index Terms — Bayes error, divergence estimation, meta learn¬ 
ing, classification, sunspots 

1. INTRODUCTION 

Meta learning is a method of learning from learned knowledge that 
can be used to improve the performance of various learning tasks Q] 
|2j. In a typical example where the learning task is classification, 
meta learning is applied by first training multiple classifiers on the 
training data. Each classifier may use either all of the training data, 
or only a subset which may differ from other subsets in the feature 
space. A test set is then fed into these classifiers and the resulting 
output is then used as input to train an overall meta classifier such as 
a majority vote or weighted majority vote. Other variations on meta 
learning applied to classification exist CHS- 

Meta learning can incorporate information about the feature 
space that is independent of the classifiers such as the Bayes error 
rate (BER). Consider the problem of classifying a feature vector x 
into one of two classes C 1 or C2. Denote the a priori class probabil¬ 
ities as qi = Pr(Ci) > 0 and <72 = Pr(C2) = 1 — qi > 0. 
The conditional densities of x given that x belongs to C\ or 
C2 are denoted by /i(at) and f2(x), respectively, and the Bayes 
classifier assigns x to C\ if and only if <71/1(2;) > <72/2(2;). If 
p(x) = <71/1(2;) + <72/2(2;), the average error rate of this classifier, 
known as the BER, is 

P e * = J min (Pr (Ci|a:) , Pr (C2I*)) p(*)dx 
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= J min (qifi{x), <72/2(2;)) dx. (1) 

The BER is the minimum classification error rate that can be 
achieved by any classifier on x’s feature space 0. Because of this 
property, the BER can be used in a meta learning problem where the 
base classifiers are trained on different feature spaces by weighting 
the output of the base classifiers based on the Bayes error of the un¬ 
derlying feature space. If a given feature space results in a lower 
Bayes error than another feature space, then the output of the corre¬ 
sponding classifier would have a higher weight as it would presum¬ 
ably perform better than a classifier on the alternate feature space. 

The BER can be used at other stages of meta learning such as 
in the selection of the base classifiers and model selection. This is 
because the BER provides a benchmark for classification on a given 
feature space. If a specific classifier applied to the feature space 
yields an estimated error rate that is significantly above the BER. 
then it is likely that a different classifier or parameters may result in a 
lower error rate. On the other hand, if the classifier’s estimated error 
rate is below the BER, then the classifier is likely to be overfitting 
the data and may not generalize well to new samples from the feature 
space. A different classifier or parameters may then be chosen. This 
technique can also be applied in the traditional supervised learning 
approach where a single classifier is used. 

The BER can also be beneficial for feature selection in classi¬ 
fication problems. The BER is monotonic in the number of fea¬ 
tures in the sense that increasing the number of features does not 
decrease the accuracy of the Bayes classifier. However, for many 
classifiers, including irrelevant features can decrease the prediction 
accuracy 1112). Including a large number of features can also be 
computationally burdensome and create difficulties in storage and 
memory mm. Thus from a practical perspective, using only a sub¬ 
set of the features may result in better performance. If the BER is 
known for all subsets of features, then a logical method of feature se¬ 
lection would be to choose the smallest subset of features such that 
the BER of that subset is negligibly larger than the BER of the full 
feature space 00). The eliminated features could be considered re¬ 
dundant or irrelevant since including them in the classification leads 
to a neglible improvement in accuracy. 

Unfortunately, computing the BER requires perfect knowl¬ 
edge of the underlying data distributions, which is rarely available. 
Even for parametric models of the densities, Eq. [I] requires multi¬ 
dimensional integration and has no closed form solution for many 
models. Evaluating the BER in these cases involves computationally 
intensive numerical integration, especially for high dimensions. For 
these reasons, many feature selection algorithms have focused on 
other optimality criteria such as minimizing the prediction error of 
a specific classifier (7) or maximizing the statistical dependency 
between the feature subset and class assignments via some criterion 



such as mutual information or correlation tm However, selecting 
features by minimizing the classifier prediction error can be com¬ 
putationally intensive and only provides a solution for the specified 
classifier. Additionally, other methods based on maximizing statisti¬ 
cal dependency can be too restrictive or otherwise problematic ED- 

Given these problems, many bounds on the BER have been de¬ 
rived that are related to /-divergences Ii3tll6l . These bounds have 
been used in applications involving the BER including feature selec¬ 
tion Q3EM- 

Accurate estimation of these bounds on the BER requires ac¬ 
curate estimation of an /-divergence functional, often in a nonpara- 
metric setting. Until recently, little has been known about the proper¬ 
ties of nonparametric /-divergence estimators such as convergence 
rates and the asymptotic distribution. In Moon and Hero [21], it 
was shown that the bias of simple density plug-in estimators of /- 
divergence converges very slowly to zero when the dimension of the 
feature space is high, which limits their utility. Nguyen et al 1221 pro¬ 
posed an /-divergence estimation method based on estimating the 
likelihood ratio of two densities that achieves the parametric mean 
squared error (MSE) convergence rate when the densities are suffi¬ 
ciently smooth. However, this method can be computationally in¬ 
tensive for large sample sizes and the asymptotic distribution of the 
estimator is currently unknown. Berisha et al D3 also proposed 
a consistent estimator of specific bounds on the BER based on the 
construction of a minimal spanning tree (MST) that does not require 
density estimation. However, the convergence rate of this estimator 
is unknown and it is restricted to specific BER bounds instead of /- 
divergences in general. Finally, other /-divergence estimators have 
been proposed that achieve the parametric rate when the densities 
are sufficiently smooth min. However, some of these estimators 
are restricted to certain subsets of /-divergences, and they require an 
optimal kernel which can be difficult to implement and compute. 

Many of these problems can be countered effectively by using 
meta learning. While nreta learning was described above in the clas¬ 
sification setting, it can also be applied to estimation to improve the 
convergence rates. This is typically done by taking a weighted sum 
of base estimators that individually converge slowly. Then by an ap¬ 
propriate choice of weights, the weighted ensemble estimator con¬ 
verges rapidly to the true value. For example, Sricharan et al |[26| 
derived a nonparametric estimator of generalized entropy function¬ 
als that converges at the parametric rate by using simple plug-in den¬ 
sity estimators as the base estimators. More recently, similar theory 
was applied by Moon and Hero Glued to obtain a nonparametric 
/-divergence functional estimator based on a weighted ensemble of 
fc-nearest neighbor (nn) estimators. This estimator enjoys the ad¬ 
vantages of being simple to implement and achieving the parametric 
convergence rate when the densities are sufficiently smooth. 

In this paper, we focus on estimating multiple bounds on the 
Bayes error derived from /-divergence functionals in a nonparamet¬ 
ric setting using the weighted fc-nn estimator from 12111271 . We first 
estimate the bounds on simulated data where the true BER is com¬ 
putable. This gives a guide for the empirical utility of each bound. 
We then apply this to real data by estimating the bounds on the BER 
for the classification of sunspot images using the features derived 
in 1281 . This gives a measure of the utility of the derived feature 
space in this supervised setting. We also compare the results to 
those obtained usinge the MST estimator 03- The paper is out¬ 
lined as follows. Section [2] describes the weighted fc-nn estimator 
of /-divergence functionals while Section[3]provides the bounds on 
the Bayes error and their relation to /-divergences. In Section[4] the 
simulated results are presented. Section[5]describes the sunspot data 
and presents the estimated bounds on the BER. Section[6]concludes. 


2. META LEARNING OF F -DIVERGENCE FUNCTIONALS 


If /i and fa are d-dimensional probability densities with common 
support, then the /-divergence between /i and fa has the following 
form 1291: 

D*(fuh) = f * (fjfy) M x ) dx ■ ( 2 ) 

For to be considered a true divergence, the function <j> must be 
convex and (f)(1) = 0. This ensures that is nonnegative and 
that D l f > (fi, fa) = 0 if and only if fa = fa which is the definition 
of divergence. As for general divergences, /-divergences are not 
required to be symmetric or satisfy the triangle inequality. 

In this work, we are concerned with a broader class of func¬ 
tions that we call /-divergence functionals. This class consists of 
functions of the form in Eq.[Dexcept that we do not require <f> to be 
convex or that (f)(1) = 0. Working with /-divergence functionals in¬ 
stead of only /-divergences provides greater flexibility in bounding 
the BER. 

Assume that the densities fa and fa have a common bounded 
support set S', fa and fa are strictly lower bounded; and fa, fa, 
and (f> are smooth. Assume that T = N + M i.i.d. realizations 
Xt = {Xi,..., Xjv, Xjv+i, ..., Xjv+m} are available from the 
density fa and M i.i.d. realizations 34/ = {Yi,...,Ym} are 
available from the density fa, where M is proportional to T. Un¬ 
der these assumptions, there exists a nonparametric estimator of D$ 
that achieves the parametric MSE rate of O (^r). This estimator 
first calculates an ensemble of fc-nn density estimators of the den¬ 
sities fa and fa at the points {Xi,..., Xjv} using different values 
of fc. Then for each k, a base plug-in estimator of D$ is calculated 
by taking the empirical average of the (f> evaluated at the likelihood 
ratio of the estimated densities. From HD. the bias and variance of 
these base estimators is known. Then using the theory of optimally 
weighted ensemble estimation 1261 . an estimator with low bias can 
be obtained by taking a weighted sum of the base estimators using 
the appropriate weights. Details are given in the following. 

Let k < M and let p 2 ,k(i) be the distance of the fcth nearest 
neighbor of Xi in {Xm+i, • • •, Xjv}. Similarly, define p\,k(i) be 
the distance of the fcth nearest neighbor of Xi in {Yi, ...,Ym}. 
Then the fc-nn density estimates at the point Xi are [301 


ffA(Xi) 


k 

M cp^ k (i)’ 


where c is the volume of a d-dimensional unit ball. The functional 
is then approximated as 
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Now choose an ensemble of positive real numbers £ = {£\,... ,£fa\ 
where L > d — 1 and let k(£) = £y/~M. It was shown in ED that 
the bias and variance of are 


Var(D* ihW ) = 0(1 + ] L). 


Let w be a vector of weights with length L and define ti$, w := 
Srgr w (£)&</,,km- From the theory of optimally weighted ensem¬ 
ble estimation (261. there exists a weight vector wq such that the 








4. SIMULATIONS 


MSE of D^„ 0 is O (^)- The weight vector wo achieves this by 
essentially zeroing out the lower order bias terms at the expense of 
a slight increase in the variance, wo can be found via an offline con¬ 
vex optimization problem that only depends on the sample size T 
and the basis functions £3. See I21ll26ll27l for more details. 


3. BOUNDS ON THE BAYES ERROR RATE 


Multiple upper and lower bounds on the BER related to /-divergences 
exist. A classical bound is the Chernoff bound Da. It is derived 
from the fact that for a, b > 0, min(o, b) < o“6 1_a Va 6 (0,1). 
Replacing the minimum function in Eq.Q]with this bound gives 

Pe < 9?92 _ “Ca(/l,/2), (3) 

where c a (/i,/ 2 ) = f fi(x)f 2 ~ a {x)dx is the Chernoff «-coefficient. 
The Chernoff coefficient is found by minimizing the right hand side 
of Eq. [3] with respect to a: 

c*(/i,/ 2 )= min f fi(x)f2~ a (x)dx. (4) 

“£(0,1) J 

Combining this with Eq. Ogives an upper bound on the BER. 

In general, the Chernoff bound is not very tight. A tighter bound 
was presented in 03- Consider the following quantity: 

Ari(/i./a) = 


It was shown in DU that the BER P* is bounded above and below 
as follows: 

\ ~ \ \J Bn (/u/a) < P: < \ ~ \D qi (h,f2 )• 

Arbitrarily tight upper and lower bounds to the BER were given 
in ED We consider only the lower bound here. Define 
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p(x) 


where p(x) = qifi(x) + 52 / 2 ( 3 :) as before and a > 0. Then the 
BER is bounded below as 

Pe > ~ [ g a {fi, f2)p{x)dx =: G„(/i,/ 3 ). (7) 

a J 

The functionals in Eqs.[3land l5l7l all contain the form in Eq. [2] 
To see this, note that for the Chernoff a coefficient, (p(t ) = t“. 
For the D qi based bounds, the functions are more complicated with 


nd 0(t) - 


q2+qi t rvp; - qit+q2 for Eqs.|5land|6l respectively. 
The functions are even more complex for Eq. [7] However, if t = 
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Substituting these expressions into G a (/i,/ 2 ) gives the required 
form. Thus we can use the optimally weighted ensemble divergence 
estimator from Section[2]to estimate all of these bounds on the Bayes 
error. To estimate c*(/i, / 2 ), we estimate c a (fi, / 2 ) for multiple 
values of a (e.g. 0.01, 0.02,... , 0.99) and choose the minimum. 


In addition to the weighted fc-nn estimator, we use an alternate esti¬ 
mator for D qi based on an extension of the Friedman-Rafsky (FR) 
multivariate two sample test statistic for comparison (3ll . This es¬ 
timator is derived from the MST of the combined data set Xt U 
yM and does not require direct estimation of the densities /1 and 
h (HUSH. However, the convergence rate and asymptotic distribu¬ 
tion of this estimator are currently unknown. 


T=5000 




Fig. 1. Estimated bounds on the Bayes error rate for two unit 
variance Gaussians with dimension d = 5, varying sample sizes 
(T = 5000, 50), and varying means over 200 trials. Error bars cor¬ 
respond to a single standard deviation. The D qi based lower bounds 
are close to the actual Bayes error for both the large and small sam¬ 
ple regimes but are much more variant with a smaller sample size. 
The arbitrarily tight lower bound ( G a with a = 500) is very close 
to the Bayes error when T = 5000 and when the Bayes error is low. 


To compare the estimation performance of the various bounds 
on the BER, we consider 200 trials of two samples from two Gaus¬ 
sian distributions with unit variance and varying mean. In practice, 
we use a leave one out approach for the weighted fc-nn estimator 
and so the number of samples from both distributions is equal to 
T. In the first experiment, we fix the dimension d = 5 and vary 












































the number of samples from each distribution. Figure |T| shows the 
cases where T = 5000 and 50. We choose a = 500 for G a . In 
the large sample regime, the bounds vary smoothly as the separation 
between the means of the distributions increases. The two methods 
for estimating D qi have nearly identical results when Eq.[5]is used 
for the weighted k -nn method. If Eq.[6]is used, then the estimated 
bounds (not shown) are inaccurate. This underscores the importance 
of using an appropriate representation of the function <j> when using 
plug-in based estimation methods as numerical errors may lead to 
varying results. 

In the low sample regime, the estimates have much higher vari¬ 
ance and are more biased as the lower bounds often cross the Bayes 
error. However, the D qi based lower bounds are still fairly close to 
the true BER and are thus valuable for assessing the potential per¬ 
formance of a given feature space. Increasing the sample size to as 
little as 150 greatly improves the performance (not shown). 


d=1 




Fig. 2. Estimated bounds on the Bayes error rate for two unit vari¬ 
ance Gaussians with varying dimension (d = 1, 10) and a fixed 
sample size of T = 1000 over 200 trials. The estimated D qi based 
bounds are more biased and variant when the dimension is higher. 

In the second experiment, we fixed the number of samples at 
T = 1000 and varied the dimension. The results for d = 1 and 10 
are given in Fig. [2] In the higher dimension, the D qi lower bounds 
are closer to the BER which results in these estimates crossing over 


the BER more often. The variance in all of the estimates is also 
higher when d = 10. 

Several trends are apparent in both Figs.Q]and[2] One is that the 
variance of the D qi lower bounds decreases as the BER decreases. 
In general, the MST-based estimator is more variant than the fc-nn 
estimator except when the dimension or number of samples is high 
(e.g. d = 10 or T = 5000). This is not a substantial problem as 
an accurate estimate of the BER is less useful at higher values. This 
is because if the BER is around 0.4, then the feature space being 
considered does not improve the classification much beyond random 
guessing. Thus time and energy may be better spent on finding a 
new feature space for the problem instead of attempting to achieve 
the BER on the given feature space. 

Another observation is that for d > 1, the G a based lower bound 
is not tight for higher BER when using a = 500. Increasing a does 
not substantially improve the tightness at these values due to numer¬ 
ical precision errors. However, it may be possible to manipulate the 
expression for g a so that this is not an issue. 

Overall, these results suggest that estimating the D qi lower 
bound provides a value that is fairly close to the true BER. The 
weighted fc-nn estimator appears to be less variant than the MST 
based estimator except when the dimension or number of samples 
is sufficiently high. Thus we recommend using the D qi bounds 
to estimate the location of the BER. If this gives a range for the 
BER that is low (approximately less than 0.2) and there are enough 
samples, then G a may be estimated for a more precise estimate of 
the BER. Similar results are obtained for truncated Gaussians. 

5. BOUNDING THE BAYES ERROR OF SUNSPOT IMAGES 

We estimate bounds on the BER of a sunspot image classification 
problem. Sunspots (SS) are dark areas seen in white light images of 
the Sun. They correspond to regions of locally enhanced magnetic 
field, as can be seen on magnetogram. SS groups are commonly 
classified using the Mount Wilson classification scheme, which cate¬ 
gorizes them by eye based on their morphological features in contin¬ 
uum (white light intensity) and magnetogram (magnetic field value) 
images. Several studies have shown that major solar eruptive events 
are strongly correlated with complex SS groups (designated as /3y 
or /3-yS groups) and less so with simple SSs (a or /3 groups) 1331341 . 

Recent work has focused on clustering SSs using an image patch 
analysis of continuum and magnetogram images and by applying 
dictionary learning on the collection of patches 1281135 1. Two main 
approaches were used in 1281. In the first approach, a dictionary is 
learned for each SS image pair. The pairwise difference between 
these dictionaries is calculated by comparing the subspaces spanned 
by the dictionaries using the Grassmannian projection metric. These 
pairwise distances are then fed into a clustering algorithm. For the 
second approach, a single dictionary is learned from the combined 
collection of image patches from all SS image pairs. The dictionary 
coefficients corresponding to a single SS image pair are treated as 
samples from a distribution. The pairwise distances between these 
collections of coefficient samples is calculated by estimating the 
Hellinger distance of the underlying distribution and these distances 
are then fed into a clustering algorithm. 

The resulting clusterings from these two approaches were found 
to be correlated somewhat with the Mount Wilson classification 
scheme. In this work, we estimate the ability of the associated fea¬ 
ture spaces of these two approaches to classify a SS as ‘complex’ 
or ‘simple’ by estimating bounds on the Bayes error. We do this by 
estimating both the lower and upper bounds formed from D qi using 
both the weighted fc-nn and MST estimators for the Grassmannian 






















approach from the pairwise distances. Bootstrapping is used on the 
weighted k -nn estimators to calculate confidence intervals. For the 
Hellinger distances, we only use the MST estimator as the k -nn 
density estimator is not easily defined in the space of probability 
distributions. 

We use the same image pairs as in fl28j except we exclude the 
a groups. This is to keep the number of simple and complex im¬ 
age pairs roughly the same (192 and 182, respectively). As in f28l . 
we consider two types of areas: the area within the sunspot and the 
area near the corresponding neutral line as determined from magne¬ 
togram images. The morphology of both of these areas are taken 
into account in the Mount Wilson classification. The two metrics, 
Grassmannian and Hellinger distance, are applied within these areas 
separately and a weighted average is taken of the two distances. For 
example, if Dg,u is the distance matrix comparing the dictionaries 
learned from each SS’s neutral line using the Grassmannian metric, 
and if Dg,s is the distance matrix comparing the dictionaries learned 
from within the sunspots, then define Dg{t) = rDG,n + (1 — 
t)Dg,3 with 0 < r < 1 . The distance matrix Dg{t) is then used 
to estimate the bounds on the Bayes error for a variety of weights. 
For comparison, we calculate the error rate of a support vector ma¬ 
chine (SVM) classifier with a Gaussian kernel using 10-fold cross 
validation to select the parameters. 

Two dictionary learning methods are used: the singular value 
decomposition (SVD) and nonnegative matrix factorization (NMF). 
Figure [3] shows the estimated bounds when using SVD. Several pat¬ 
terns are apparent in the results. Both the estimated bounds and the 
SVM error rate generally increase as the weight r increases when 
the Grassmannian metric on individual dictionaries is used. This 
indicates that the dictionaries extracted from within the sunspots 
are more relevant to this classification problem than the dictionar¬ 
ies from the neutral line. The opposite occurs when the Hellinger 
distance is used on the dictionary coefficients. In this case, the es¬ 
timated bounds and SVM error rate are generally lower when the 
weight r favors the neutral line data. Strong spatial gradients in 
the magnetogram along the neutral line are often associated with 
complex SSs. Since the learned dictionaries contain patches with 
magnetogram gradients (see Figs. 4 and 5 in Moon et al 1281). the 
distributions of the corresponding coefficients within the neutral line 
may be useful for distinguishing between complex and simple ARs 
and thus lead to the decreased bounds on the BER and improved 
classification. 

The NMF results are not shown, but similar trends are observed. 
For both the Grassmannian and Hellinger metrics, the estimated 
bounds and the SVM error rate generally decrease as the weight 
increases, suggesting that the neutral line is better suited for this 
classification problem than the data from within the sunspots when 
using NMF dictionaries. However, the estimated bounds, confidence 
intervals, and error rates are generally still high (>0.25). 

In general, these results indicate that if the goal is to accurately 
classify SSs into complex or simple SSs based on the Mount Wilson 
definition, then additional or different features are required. The dic¬ 
tionary features may still be relevant for other learning tasks such as 
predicting and detecting solar eruptive events. 

6. CONCLUSION 

Applying rneta learning or ensemble methods to the problem of esti¬ 
mating /-divergence functionals results in more accurate estimates. 
This ensemble estimator is useful for estimating multiple bounds on 
the Bayes error rate. By simulation, we found that the D qi bounds 
are more accurate than the Chernoff bound and the G a bound in the 
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Fig. 3. D qi -based upper (plain line) and lower (dashed line) bounds 
on the Bayes error when classifying sunspot groups as simple or 
complex for a variety of weights compared to the error from an SVM 
classifier using SVD dictionaries. A weight of r = 0 corresponds 
to using only the data from within the sunspots while r = 1 corre¬ 
sponds to using only the neutral line data. Confidence intervals on 
the weighted fc-nn estimators are calculated via bootstrapping. The 
area around the neutral line and sunspots give better results when 
using the Hellinger and Grassmannian metrics, respectively. 


sense that they are tighter for all values of the BER. The G a bound, 
however, is closer to the BER when it is small and when the di¬ 
mension is low. The MST and weighted fc-nn estimators had similar 
performance, suggesting that the MST based method may converge 
rapidly to the true value in at least some circumstances. 

From the BER bounds of the sunspot data, we found that learned 
SVD dictionaries from the neutral line are unlikely to be helpful in 
classifying SSs (as either a simple SS or complex SS) based on the 
Mount Wilson definition. However, including the dictionary coeffi¬ 
cients from the neutral line does seem to result in lower bounds on 
the BER and better classification performance than when just using 
the dictionary coefficients from within the sunspots. Overall, addi¬ 
tional features are likely necessary to achieve accurate classification 
of sunspots into these categories. 
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